-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-node distributed training, submitit & composer integration demo #1753
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@YilunKuang thanks for opening this PR! Just wanted to confirm this is ready for review / we can take a look at it? |
@mvpatel2000 Yes, this is ready for review. The only change I made is I added another Jupiter notebook as a guide to do distributed training with composer without the composer launcher and I didn't change the rest of the codebase. If this guide is accepted, I would recommend adding a reference to this guide in the Distributed Training doc https://docs.mosaicml.com/en/stable/notes/distributed_training.html |
Hi @YilunKuang this is great thanks for submitting this PR with an example for Have you tested this notebook out on your end? If so could you please post some screen shots of the output a user is expected to see in the description of this PR? We can use that as a stop gap for any users interested in leveraging this example as well as evidence the that notebook code is functional, at least at the time of this submission. |
@bandish-shah Sure no problem. I just thought this might be helpful for someone else working with the SLURM system. Here is a screenshot of the output: These outputs are not generated directly from the Jupyter notebook but are generated by running the last code block in the Jupyter notebook as a python file. Also, I ran my script in a multi-node setting (2 nodes, each with 4 GPUs) so there are in total 8 output files from SLURM. Let me know and I can send over the whole zip file containing output from 8 GPUs. |
@YilunKuang it's very helpful as we've had others ask questions around how to use Composer with SLURM. Thank you again for doing this! |
@YilunKuang it looks like there's a few minor CI issues that need to be addressed before we can let this through.
|
@bandish-shah Glad to help! For 1, it seems like I cannot access the notion page. I got a message saying "You do not have access to MosaicML. Please contact an admin to add you as a member". For 2, thanks I will try to add the additional changes to the pull request these two days |
Typo on my end, meant to post a link to our style guide which details how to run pre-commit hooks locally, not a notion page 😅. URL fixed in the original comment. |
@bandish-shah Gotcha thanks! Will push my changes this weekend. |
@bandish-shah Just checked the two things you mentioned! My last commit (db9a8b9) seems to pass all the test, but when I manually merged the latest change from the |
@YilunKuang Looks like it was a transient error :) I merged dev again and it looks like everything is passing, thanks! |
@dakinggg Thanks for looking into it! Feel free to close this pull request. |
What does this PR do?
This PR adds a Jupyter notebook demo of how to submit a multi-node distributed training job on the SLURM cluster without using the composer launcher. Specifically, I show how to use submitit with composer in the multi-node distributed training setting. submitit automates SLURM cluster job submission in python scripts without any shell scripts, and this implementation shows how to properly set up the distributed training environment variables in python.
What issue(s) does this change relate to?
It's related to some of the requests posted on the MosaicML slack channel.
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)